<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.6: http://docutils.sourceforge.net/" />
<meta name="version" content="S5 1.1" />
<title>Automatic Machine Translation Evaluation</title>
<style type="text/css">

/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 5951 2009-05-18 18:03:10Z milde $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.

See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
  border: 0 }

table.borderless td, table.borderless th {
  /* Override padding for "table.docutils td" with "! important".
     The right padding separates the table cells. */
  padding: 0 0.5em 0 0 ! important }

.first {
  /* Override more specific margin styles with "! important". */
  margin-top: 0 ! important }

.last, .with-subtitle {
  margin-bottom: 0 ! important }

.hidden {
  display: none }

a.toc-backref {
  text-decoration: none ;
  color: black }

blockquote.epigraph {
  margin: 2em 5em ; }

dl.docutils dd {
  margin-bottom: 0.5em }

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
  font-weight: bold }
*/

div.abstract {
  margin: 2em 5em }

div.abstract p.topic-title {
  font-weight: bold ;
  text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
  margin: 2em ;
  border: medium outset ;
  padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
  font-weight: bold ;
  font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
  color: red ;
  font-weight: bold ;
  font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
   compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
  margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
  margin-top: 0.5em }
*/

div.dedication {
  margin: 2em 5em ;
  text-align: center ;
  font-style: italic }

div.dedication p.topic-title {
  font-weight: bold ;
  font-style: normal }

div.figure {
  margin-left: 2em ;
  margin-right: 2em }

div.footer, div.header {
  clear: both;
  font-size: smaller }

div.line-block {
  display: block ;
  margin-top: 1em ;
  margin-bottom: 1em }

div.line-block div.line-block {
  margin-top: 0 ;
  margin-bottom: 0 ;
  margin-left: 1.5em }

div.sidebar {
  margin: 0 0 0.5em 1em ;
  border: medium outset ;
  padding: 1em ;
  background-color: #ffffee ;
  width: 40% ;
  float: right ;
  clear: right }

div.sidebar p.rubric {
  font-family: sans-serif ;
  font-size: medium }

div.system-messages {
  margin: 5em }

div.system-messages h1 {
  color: red }

div.system-message {
  border: medium outset ;
  padding: 1em }

div.system-message p.system-message-title {
  color: red ;
  font-weight: bold }

div.topic {
  margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  margin-top: 0.4em }

h1.title {
  text-align: center }

h2.subtitle {
  text-align: center }

hr.docutils {
  width: 75% }

img.align-left, .figure.align-left{
  clear: left ;
  float: left ;
  margin-right: 1em }

img.align-right, .figure.align-right {
  clear: right ;
  float: right ;
  margin-left: 1em }

.align-left {
  text-align: left }

.align-center {
  clear: both ;
  text-align: center }

.align-right {
  text-align: right }

/* reset inner alignment in figures */
div.align-right {
  text-align: left }

/* div.align-center * { */
/*   text-align: left } */

ol.simple, ul.simple {
  margin-bottom: 1em }

ol.arabic {
  list-style: decimal }

ol.loweralpha {
  list-style: lower-alpha }

ol.upperalpha {
  list-style: upper-alpha }

ol.lowerroman {
  list-style: lower-roman }

ol.upperroman {
  list-style: upper-roman }

p.attribution {
  text-align: right ;
  margin-left: 50% }

p.caption {
  font-style: italic }

p.credits {
  font-style: italic ;
  font-size: smaller }

p.label {
  white-space: nowrap }

p.rubric {
  font-weight: bold ;
  font-size: larger ;
  color: maroon ;
  text-align: center }

p.sidebar-title {
  font-family: sans-serif ;
  font-weight: bold ;
  font-size: larger }

p.sidebar-subtitle {
  font-family: sans-serif ;
  font-weight: bold }

p.topic-title {
  font-weight: bold }

pre.address {
  margin-bottom: 0 ;
  margin-top: 0 ;
  font: inherit }

pre.literal-block, pre.doctest-block {
  margin-left: 2em ;
  margin-right: 2em }

span.classifier {
  font-family: sans-serif ;
  font-style: oblique }

span.classifier-delimiter {
  font-family: sans-serif ;
  font-weight: bold }

span.interpreted {
  font-family: sans-serif }

span.option {
  white-space: nowrap }

span.pre {
  white-space: pre }

span.problematic {
  color: red }

span.section-subtitle {
  /* font-size relative to parent (h1..h6 element) */
  font-size: 80% }

table.citation {
  border-left: solid 1px gray;
  margin-left: 1px }

table.docinfo {
  margin: 2em 4em }

table.docutils {
  margin-top: 0.5em ;
  margin-bottom: 0.5em }

table.footnote {
  border-left: solid 1px black;
  margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
  padding-left: 0.5em ;
  padding-right: 0.5em ;
  vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
  font-weight: bold ;
  text-align: left ;
  white-space: nowrap ;
  padding-left: 0 }

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  font-size: 100% }

ul.auto-toc {
  list-style-type: none }

</style>
<!-- configuration parameters -->
<meta name="defaultView" content="slideshow" />
<meta name="controlVis" content="hidden" />
<!-- style sheet links -->
<script src="ui/default/slides.js" type="text/javascript"></script>
<link rel="stylesheet" href="ui/default/slides.css"
      type="text/css" media="projection" id="slideProj" />
<link rel="stylesheet" href="ui/default/outline.css"
      type="text/css" media="screen" id="outlineStyle" />
<link rel="stylesheet" href="ui/default/print.css"
      type="text/css" media="print" id="slidePrint" />
<link rel="stylesheet" href="ui/default/opera.css"
      type="text/css" media="projection" id="operaFix" />

<style type="text/css">
#currentSlide {display: none;}
</style>
</head>
<body>
<div class="layout">
<div id="controls"></div>
<div id="currentSlide"></div>
<div id="header">

</div>
<div id="footer">
<h1>Automatic Machine Translation Evaluation</h1>

</div>
</div>
<div class="presentation">
<div class="slide" id="slide0">
<h1 class="title">Automatic Machine Translation Evaluation</h1>


</div>
<div class="slide" id="rationale">
<h1>1. Rationale</h1>
<p>The closer a machine translation is to a professional
human translation, the better it is.</p>
<p>To judge the quality of a translation, we need:</p>
<ol class="arabic simple">
<li>a corpus of good quality human reference translations.</li>
<li>a nemerical &quot;translation closeness&quot; metric</li>
</ol>
</div>
<div class="slide" id="baseline-bleu-metric">
<h1>2. Baseline BLEU Metric</h1>
</div>
<div class="slide" id="modified-n-gram-precision">
<h1>Modified n-gram precision</h1>
<p>Count matches from:</p>
<ul class="simple">
<li>n-gram of the candidate sentence</li>
<li>n-gram of the reference sentence</li>
</ul>
<p>sum(clipped candidate n-gram counts) / (# ngrams in candidate sentence).</p>
<p>clipped word count = min(count, max_ref_count)</p>
</div>
<div class="slide" id="id1">
<h1>Modified n-gram precision</h1>
<p>These matches are position independent.</p>
<p>The more the matches, the better the candidate translation is.</p>
<p>This sort of modified n-gram precision scoring captures two aspects of
translation: adequacy &amp; fluency.</p>
</div>
<div class="slide" id="examples">
<h1>Examples</h1>
<p>Refer to example 1 and 2 in the handout.</p>
</div>
<div class="slide" id="blocks-of-text">
<h1>blocks of text</h1>
<p>Basic unit is sentence.</p>
<pre class="doctest-block">
&gt;&gt;&gt; summed clipped counts across whole
&gt;&gt;&gt; corpus for all n-grams
&gt;&gt;&gt;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
&gt;&gt;&gt; # n-grams in the whole corpus
</pre>
<p>(formula will be written on board)</p>
</div>
<div class="slide" id="blocks-of-text-example">
<h1>Blocks of text: example</h1>
<div class="line-block">
<div class="line">candidate 1: the the the the the the the</div>
<div class="line">candidate 2: I am doing a presentation.</div>
</div>
<div class="line-block">
<div class="line">reference 1: The cat is on the mat.</div>
<div class="line">reference 2: There is a cat on the mat.</div>
</div>
</div>
<div class="slide" id="application">
<h1>Application</h1>
<p>Example in figure 1.</p>
<p>Conclusion, any single n-gram precision score can distinguish
between a good translation and a bad translation.</p>
</div>
<div class="slide" id="application-cont-d">
<h1>Application (cont'd)</h1>
<p>Example in figure 2.</p>
<p>Conclusion:</p>
<ul class="simple">
<li>Able to distinguish translations that do differ much in quality.</li>
<li>Able to distinguish two human translations of differing quality.</li>
</ul>
<p>This metric actually works well!</p>
</div>
<div class="slide" id="combining-n-gram-sizes">
<h1>Combining n-gram sizes</h1>
<p>Weighted linear average.</p>
<p>However with the increase of n, precision decays exponentially.</p>
<p>So use a weighted average of the logarithm with uniform weight.</p>
</div>
<div class="slide" id="sentence-length">
<h1>Sentence length</h1>
<p>A candidate translation should be neither too long nor too short,</p>
<p>To some extent, the n-gram precision already accomplishes this, however,
there are exceptions.</p>
</div>
<div class="slide" id="problem-example">
<h1>Problem example</h1>
<p>Example 3.</p>
<p>Uni-gram and 2-gram modified precision are both 1.</p>
</div>
<div class="slide" id="sentence-brevity-penalty">
<h1>Sentence brevity penalty</h1>
<p>Long sentence already penalized by modified n-gram precision.</p>
<p>This multiplicative breivty penalty puts a penalty to sentences that
don't match reference translation in length.</p>
<p>Cadidate's length is same as any reference's length, brevity penalty
is 1.0.</p>
</div>
<div class="slide" id="id2">
<h1>Sentence brevity penalty</h1>
<p>c: the length of the candidate translation,
r: be the effective reference corpus length.</p>
<pre class="doctest-block">
&gt;&gt;&gt; 1 if c &gt; r
&gt;&gt;&gt; BP =
&gt;&gt;&gt; e(1−r/c) if c ≤ r
</pre>
</div>
<div class="slide" id="bleu-evaluation">
<h1>3. BLEU evaluation</h1>
<pre class="doctest-block">
&gt;&gt;&gt; 1 if c &gt; r
&gt;&gt;&gt; BP =
&gt;&gt;&gt; e(1−r/c) if c ≤ r
</pre>
<p>BLEU = BP * exp( sum( n in (1..N): wn * log(pn)
(will be written on board)</p>
</div>
<div class="slide" id="id3">
<h1>BLEU evaluation</h1>
<p>Table 1: BLEU on 500 sentences</p>
<pre class="doctest-block">
&gt;&gt;&gt; ----------------------------------------------
&gt;&gt;&gt; | S1 | s2 | S3 | H1 | H2 |
&gt;&gt;&gt; ----------------------------------------------
&gt;&gt;&gt; | 0.0527 | 0.0829 | 0.0930 | 0.1934 | 0.2571 |
&gt;&gt;&gt; ----------------------------------------------
</pre>
</div>
<div class="slide" id="id4">
<h1>BLEU evaluation</h1>
<p>Table 2: Paired t-statistics on 20 blocks</p>
<pre class="doctest-block">
&gt;&gt;&gt; ------------------------------------------------
&gt;&gt;&gt; | S1 | S2 | S3 | H1 | H2 |
&gt;&gt;&gt; ------------------------------------------------
&gt;&gt;&gt; Mean | 0.051 | 0.081 | 0.090 | 0.192 | 9.256 |
&gt;&gt;&gt; ------------------------------------------------
&gt;&gt;&gt; StdDev| 0.017 | 0.025 | 0.020 | 0.030 | 0.039 |
&gt;&gt;&gt; ------------------------------------------------
&gt;&gt;&gt; t | -- | 6 | 3.4 | 24 | 11 |
&gt;&gt;&gt; ------------------------------------------------
</pre>
</div>
<div class="slide" id="human-evaluation">
<h1>4. Human Evaluation</h1>
<p>Two groups of human judges:</p>
<ul class="simple">
<li>10 native English speakers</li>
<li>10 native Chinese speakers who had lived in Enlish speaking countries for the past several years</li>
</ul>
<p>None of them is professional.</p>
</div>
<div class="slide" id="human-evaluation-cont-d">
<h1>Human Evaluation (cont'd)</h1>
<p>Judged 5 standard systems on:
a Chinese sentence subset from 500 sentence corpus.</p>
<p>Paired each source sentence with each of its 5 translations, total 250 pairs</p>
<p>Rated each translation form 1(very bad) to 5(very good), monolingual group
judged based on translation's readability and fluency.</p>
</div>
<div class="slide" id="bleu-vs-human-evaluation">
<h1>5. BLEU vs Human Evaluation</h1>
<p>Comparable results over two reference translations for the 5 systems.</p>
<p>Figure 5: BLEU vs monolingual</p>
<p>Figure 6: BLEU vs Bilingual</p>
<p>Figure 7: Scores for BLEU, monolingual, bilingual linearly normalised</p>
</div>
<div class="slide" id="conclusion">
<h1>6. Conclusion</h1>
<p>BLEU allows rapid evaluation of translation approaches for researchers.</p>
<p>Authors believe BLEU could be adapted evaluating summarization tasks.</p>
</div>
</div>
</body>
</html>
